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METHOD AND APPARATUS FOR EXTRACTING AND 
EVALUATING MUTUALLY SIMILAR PORTIONS IN 
ONE -DIMENSIONAL SEQUENCES IN MOLECULES AND /OR 
THREE-DIMENSIONAL STRUCTURES OF MOLECULES 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

This invention relates to a method and apparatus 
for extracting and evaluating mutually coinciding or similar 
portions between sequences of atoms or atomic groups in 
molecules and/or between three-dimensional structures of 
molecules and, particularly to a method and apparatus for 
automatically extracting and evaluating mutually coinciding 
or similar portions between amino acid sequences n protein 
molecules and/or between three-dimensional structures of 
protein molecules. 

2. Description of the Related Art 

A gene is in substance DNA, and is expressed as a 
base sequence including four bases of A (adenine) , T 
(thymine) , C (cytosine) , and G (guanine) . There are about 
twenty types of amino acids constituting an organism, and it 
has been shown that arrangements of three bases correspond 
to the respective amino acids. Accordingly, it has been 
found out that the amino acids are synthesized according to 
the base sequences of the DNA in the organism and that a 
protein is formed by folding the synthesized amino acids. 
The arrangement of amino acids is expressed as an amino acid 
sequence in which the respective amino acids are expressed 
in letters similar to the base sequence. 

A method for determining a sequence of bases and 
amino acids has been established together with the 
development of molecular biology, and therefore a huge 
amount of gene information including a base sequence 
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data and an amino acid sequence data has been stored. 
Thus, in the field of gene information processing, a 
core subject has been how to extract biological 
information concerning the structure and function of 
5 the protein out of the huge amount of stored gene 
information . 

A basic technique in extracting the 
biological information is to compare the sequences . 
This is because it is considered that a similarity is 

10 found in the biological function if the sequences are 
similar. Accordingly, by searching a data base of 
known sequences whose functions are known for a 
sequence similar to an unknown sequence a homology 
search for estimating a function of an unknown 

15 sequence, and an alignment such that a sequence is 
rearranged so as to maximize the degree of analogy 
between the compared sequences when researchers 
compare the sequences are presently studied. 

Further, it is considered that a region of 

20 the sequence, in which a function important for the 
organism is coded, is perpetuated in the evolution 
process. For instance, a commonly existing sequence 
pattern (region) is known to be found when the amino 
acid sequences in proteins having the same function 

25 are compared between different types of organisms. 

This region is called a motif. Accordingly, if it is 
possible to extract the motif automatically, the 
property and function of the protein can be shown by 
finding which motif is included in the sequence. Fur- 

30 ther, the automatic motif extraction is applicable to 
a variety of protein engineering fields such as 
strengthening of the properties of the preexisting 
proteins, addition of functions to the preexisting 
proteins, and synthesis of new proteins. As described 

35 above, it can be considered as an effective means in 
extracting the biological information to extract the 
motif out of the amino acid sequence. However, the 
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extracting method is not yet established, and the 
researchers currently decide manually which part is a 
motif sequence after the homology search and 
alignment . 

5 A dynamic programming technique that is used 

in a voice recognition processing has been the only 
method used for automatically comparing two amino acid 
sequences . 

However, according to the method of comparing 

10 the amino acid sequences using the dynamic programming 
technique, the amino acid sequences are compared two- 
dimensional ly. Thus, this method requires a large 
memory capacity and a long processing time. 

Meanwhile, in the fields of physics and 

15 chemistry, in order to examine the properties of a new 
(unknown) substance and to produce the new substance 
artificially, three-dimensional structures of 
substances are determined by a technique such as an X- 
ray crystal analysis or an NMR analysis, and 

20 information on the determined three-dimensional 

structures is stored in a data base. As a typical 
data base, a PDB (Protein Data Bank) in which three- 
dimensional structures of proteins or the like 
identified by the X-ray crystal analysis of protein 

25 are registered is widely known and universally used. 
Further, a CSD (Cambridge Structural Database) is 
known as a data base in which chemical substances are 
registered. 

In the protein, a plurality of amino acids 

30 are linked to one another as a single chain and this 
chain is folded in an organism to thereby form a 
three-dimensional structure. In this way, the protein 
exhibits a variety of functions. The respective amino 
acids are expressed by numbering them from an N- 

35 terminal through a C-terminal . These numbers are 
called amino acid numbers , amino acid sequence 
numbers, or amino acid residue numbers. Each amino 



acid includes a plurality of atoms according to the type 
thereof. Therefore, there are registered names and 
administration numbers of protein, amino acid numbers 
constituting the protein, types and three-dimensional 
coordinates of atoms constituting the respective amino 
acids, and the like in the PDB. 

It is known that the three-dimensional structure of the 
substance is closely related to the function thereof from 
the result of chemical studies conducted thus far, and a 
relationship between the three-dimensional structure and 
function is shown through a chemical experiment in order to 
change the substance and to produce a substance having anew 
function. Particularly, since a structurally similar 
portion (or a specific portion) between the substances 
having the same function is considered to influence the 
function of the substance, it is essential to discover a 
similar structure commonly existing in the three-dimensional 
structures . 

However, since there is no method of extracting a 
characteristic portion directly from the three-dimensional 
coordinate, the researchers are at present compelled to 
express the respective three-dimensional structures in a 
three-dimensional graphic system and to search the 
characteristic portion manually. There is in general no 
method of determining an orientation of the substance as a 
reference, which requires a substantial amount of time. 

When the researcher searches the similar three- 
dimensional, structure, an r.m.s.d. (root mean square 
distance) value is used as a scale of the similarity of the 
three-dimensional structures of the substances. The 
r.m.s.d. value is a value expressing a square root of a mean 
square distance between the 



corresponding elements constituting the substances. 
Empirically, the substances are thought to be 
exceedingly similar to each other in the case where 
the r.m.s.d value between the substances is not 
greater than lA. 

For instance, it is assumed that there are 
substances expressed by a point set A = <a lf a 2 , 
a lf a n } and a point set B = {b lf b 2/ b jf 

b n >, wherein a ± ( i » 1 , 2 , . . . , m) and b s (i = 1, 2, 

n) are vectors expressing positions of the 
respective elements in the three-dimensional space. 
The elements constituting these substances A and B are 
related to each other, and the substance B is rotated 
and moved so that the r.m.s.d value between the 
corresponding elements is minimized. For example, if 
a, is related to b, (k = 1, 2, n) , the r.m.s.d 

value is obtained in the following equation (1) 
wherein U denotes a rotation matrix and W k denote 
respective weights : 

(£ {w k {Ub k -a k )*))^ ..-(I) 
r.m.s.d. — • 

A technique of obtaining the rotation and movement of 
the substances, which minimizes the r.m.s.d value 
between these corresponding points, is proposed by 
Kabsh et al. (for example, refer to "A Solution for 
the Best Rotation to Relate Two Sets of Vectors," by 
W. Kabsh, Acta Cryst. (1976), A32, 923), and is 
presently widely used. However, since the same number 
of points are compared according to this method, the 
researchers are presently studying, by trial and 
error, which combinations of elements are related to 
the other substances so as to obtain the minimum 
r.m.s.d value. 

Further, it is necessary to study the 
preexisting substances in order to produce the new 
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substance. For instance, in the case where the heat 
resistance of a certain substance is preferably- 
strengthened, a structure commonly existing among the 
strong heat resisting substances is determined, and 
5 such a structure is added to a newly produced 

substance to thereby strengthen the function of the 
substance. To this end, such a function is required 
as to retrieve the necessary structure from the data 
base. However, the researchers are presently studying 

10 the necessary structure from the data base, by trial 
and error, using the computer graphic system for the 
aforementioned reasons. 

As described above, the operators are 
compelled to graphically display the three-dimensional 

15 structure of the substance they want to analyze using 
the graphic system, and to analyze by visual 
comparison with other molecules on a screen, 
superposition, and like operations. 

Meanwhile, basic structures such as an a 

20 helix and a £ strand are commonly found in the three- 
dimensional structure of protein, and they are called 
a secondary structure. Methods of carrying out an 
automatic search by a similarity of the secondary 
structure without using the r.m.s.d. value have been 

25 considered. According to these methods, a partial 
structure is expressed by symbols of the secondary 
structures along the amino acid sequence and the 
comparison is made using these symbols. Therefore, 
the comparison could not be made according to a 

30 similarity of the spatial positional relationship of 
the partial structure . 

As mentioned above, the case where the three- 
dimensional structure of the substance is analyzed 
using the CSD and PDB, a great amount of time and 

35 labor are required to manually search a huge amount of 
data for a structure and to compare the retrieved 
structure with the three-dimensional structure to be 
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analyzed, thereby imposing a heavy burden on the operators. 
For that matter, the data included n the data base cannot be 
utilized effectively, thus presenting the problem that the 
structure of the substance cannot be analyzed sufficiently. 
5 Accordingly, there has been the need for a retrieval system 

that retrieves the structure based on the analogy of the 
three-dimensional structures of the three-dimensional 
structure data base. 

SUMMARY OF THE INVENTION 

10 An object of the invention is to provide method and 

apparatus capable of automatically extracting and evaluating 
mutually coinciding or similar portions between sequences of 
atoms or atomic groups in molecules such as protein 
molecules in accordance with a simple processing mechanism. 

15 Another object of the invention is to provide method 

and apparatus capable of automatically extracting and 
evaluating a mutually coinciding or similar portions between 
three-dimensional structures of the molecules such as 
protein molecules . 

2 0 In accordance with the present invention there is 

provided a method of analyzing sequences of atomic groups 
including a first sequence having m atomic groups and a 
second sequence having n atomic groups where m and n are 
integers, comprising the steps of: 

25 a). preparing an array S[i] having array elements S [0] 

to S [m] ; 

b) initializing all array elements of the array S[i] 
to zero and initializing an integer j to 1; 

c) adding to 1 to each array element S [i] tha tis 
30 equal to an array element S [r] and that i * r if the array 

element S [r] is equal to an array element S [r-1] where r is 
an occurrence position of j -th atomic group of the second 
sequence in the first sequence; 

d) adding 1 to the integer j ; 

35 e ) repeating the steps c) and d) until the 




integer j exceeds n; and 

f) obtaining a longest common atomic group 
number between the first and the second sequences from 
a value of the array element S[m] . 

5 It is preferable that the method further comprises 

the steps of: ~ 

g) preparing an array data[k] having array 
elements data[0], data[l]...; 

h) storing paired data (r, j) in an array 

10 element data[k] if the array element S[i] is changed 
in the step c) where k = s[r]; 

i) linking the paired data (r, j) stored in the 
step h) to paired data (r' , j') if r' < r and j' < j 
where the paired data (r' , j') is one stored in an 

15 array element data[k-l]; and 

j ) obtaining a longest common subsequence 
between the first and the second sequences' and 
occurrence positions of the longest common subsequence 
in the first and the second sequence by tracing the 

20 link formed in the step i). 

In accordance with the present invention there is 
also provided a method of analyzing three-dimensional 
structures including a first structure expressed by 
three-dimensional coordinates of elements belonging to 

25 a first point set and a second structure expressed by 
three-dimensional coordinates of elements belonging to 
a second point set, comprising the steps of: 

a) generating a combination of correspondence 
satisfying a restriction condition between the 

30 elements belonging to the first point set and the 

elements belonging to the second point set from among 
all candidates for the combination of correspondence; 
and 

b) calculating a root mean square distance 

35 between the elements corresponding in the combination 
of correspondence generated in the step a) . 

In accordance with the present invention there is 
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also provided a method of analyzing three-dimensional 
structures including a first structure expressed by- 
three-dimensional coordinates of elements belonging to 
a first point set and a second structure expressed by 
5 three-dimensional coordinates of elements belonging to 
a second point set, comprising the steps of: 

a) dividing the second point set into a 
plurality of subsets having a size that is determined 
by the size of the first point set; 

10 b) generating a combination of correspondence 

satisfying a restriction condition between the 
elements belonging to the first point set and the 
elements belonging to each of the subsets of the 
second point set from among all candidates for the 

15 combination of correspondence; and 

c) calculating a root mean square distance 
between the elements corresponding in the combination 
of correspondence generated in the step b) . 

In accordance with the present invention there is 

20 also provided a method of analyzing three-dimensional 
structures including a first structure expressed by 
three-dimensional coordinates of elements belonging to 
a first point set and a second structure expressed by 
three-dimensional coordinates of elements belonging to 

25 a second point set, comprising the steps of: 

a) dividing the first point set and the second 
point set into first subsets and second subsets, 
respectively, according to a secondary structure 
exhibited by the three-dimensional coordinates of the 

30 elements of the first and the second point sets; 

b) generating a combination of correspondence 
satisfying a first restriction condition between the 
first subsets and the second subsets from among 
candidates for the combination of correspondence; 

35 c) determining an optimum correspondence between 

the elements belonging to each pair of subsets 
corresponding in the combination of correspondence 



10 



generated in the step b), and 

d) calculating a root mean square distance 
between all of the elements corresponding in the 
optimum correspondence in the step c ) . 

In accordance with the present invention there is 
also provided an apparatus for analyzing sequences of 
atomic groups including a first sequence having 
m atomic groups and a second sequence having n atomic 
groups where m and n are integers, comprising: 

means for preparing an array S[i] having 
array elements S[0] to S[m]; 

means for initializing all array elements of 
the array S[i] to zero and initializing an 
integer j to 1 ; 

means for renewing the array S[i] by adding 1 
to each array element S[i] that is equal to an array 
element S[rJ and that i > r if the array element S[r] 
is equal to an array element S[r-1] where r is an 
occurrence position of j-th atomic group of the second 
sequence in the first sequence; 

means for incrementing the integer j by 1; 
means for repeatedly activating the renewing 
means and the incrementing means until the integer j 
exceeds n; and 

means for obtaining a longest common atomic 
group number between the first and the second 
sequences from a value of the array element S[xo]. 

It is preferable that the apparatus further 
comprises : 

means for preparing an array data[k] having 
array elements data[0], data[l]...; 

means for storing paired data (r, j) in an 
array element data[k] if the array element S[i] is 
changed by the renewing means where k = S[r]; 

means for linking the paired data (r, j) 
stored by the storing means to paired data (r' , j') if 
r' < r and j' < j where the paired data (r' , j') is 



one stored in an array element data[k-l]; and 
means for obtaining a longest common 
subsequence between the first and the second sequences 
and occurrence positions of the longest common 
subsequence in the first and the second sequence by 
tracing the link formed by the linking means. 

In accordance with the present invention there is 
provided an apparatus for analyzing three-dimensional 
structures including a first structure expressed by 
three-dimensional coordinates of elements belonging to 
a first point set and a second structure expressed by 
three-dimensional coordinates of elements belonging to 
a second point set, comprising: 

means for generating a combination of 
correspondence satisfying a restriction condition 
between the elements belonging to the first point set 
and the elements belonging to the second point set 
from among all candidates for the combination of 
correspondence; and 

means for calculating a root mean square 
distance between the elements corresponding in the 
combination of correspondence generated by the 
generating means . 

In accordance with the present invention there is 
provided an apparatus for analyzing three-dimensional 
structures including a first structure expressed by 
three-dimensional coordinates of elements belonging to 
a first point set and a second structure expressed by 
three-dimensional coordinates of elements belonging to 
a second point set, comprising the steps of: 

means for dividing the second point set into 
a plurality of subsets having a size that is 
determined by the size of the first point set; 

means for generating a combination of 
35 correspondence satisfying a restriction condition 

between the elements belonging to the first point set 
and the elements belonging to each of the subsets of 
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the second point set a«g aU candidates for the 

combination of correspondence; and 

m eans for calculating a root mean square 
distance between the elements corresponding in the 
combination of correspondence generated by 

^TZ::*^ with the present invention there is 
also provided an apparatus for analyzing three- 

belonging to a second point set = . ■ 

means for dividing the first poi 
the second point set into first subsets and second 
subsets, respectively, according to a secondary 
structure exhibited by the three-dimensional 
coordinates of the elements of the first and 
aec0 nd pointy gen ^ ing a corobiMtion o£ 

despondence satisfying ^^^ econd 

correspondence; 

means for determining an optimum 

despondence between ^^^^ T 

distance between all of the elements corresponding 
the optimum correspondence. 

brief T «™z::z:™^ - 

o£ /glTi^o^tl survey apparatus according to an 
embodiment of the present s £or 

Figure 2 is a flowchart showing a proces 
. detectLg a longest co^on character number m a LCS 
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detection unit of Fig. 1; „„ esE 
Figures 3 and 4 are flowcharts showing a process 
for detecting an LCS and occurrence positions thereof 

of occlrrence positions generated in the LCS detection 

""Ugure 6 is a diagram plaining an example of the 

operation of the LCS detection unit; 

Figure 7 is a diagram showing a l.nxed data 

structure generated in the LCS detection unit; 

Figure 8 is a flowchart showing the Imxed data 
structure tracing operation; 

Figure 9 is a flowchart showing an operation of a 
retrieval process called in the tracing operation. 

Figure !0 is a diagram showing an example of 
output results of the gene information survey 

aPP3 ™ e " 1- ' Sh ° Wln9 an0theI eXa, " Ple ° £ 

- -nurirra^rru — - 

determination of correspondence of partial three- 

^Z^ZZ:^ of two nonordered 
point sets; . _ n a iaorithm tor 



sets ; 
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generating a combination of correspondence between two 
ordered point sets; 

Figure 18 is a diagram showing a tree structure 
expressing candidates for a combination of 
5 correspondence between elements of two ordered point 
sets that are partially related to each other; 

Figure 19A and 19B are diagrams explaining the 
refining of candidates using a distance relationship; 

Figures 20A and 20B are diagrams explaining 
10 refining of candidates using an angle relationship; 

Figure 21 is a diagram showing a tree structure 
explaining the refining of candidates using a 
restriction condition of the number of nil elements; 

Figure 22 is a block diagram showing a 
15 construction of a molecular structure display device 
according to another embodiment of the present 
invention; 

Figures 23A and 23B are diagrams showing amino 
acid sequences of calmodulin and troponin C, 
20 respectively; 

Figures 24A and 24B are diagrams showing three- 
dimensional structures of calmodulin and troponin C, 
respectively; o 
Figure 25 is a diagram showing an example of 
25 output results of the device of Fig. 22; 

Figure 26 is a diagram showing another example of 
output results of the device of Fig. 22; 

Figure 27 is a block diagram of a construction of 
a three-dimensional structure retrieval device 
30 according to another embodiment of the present 
invention; 

Figure 28 is a diagram showing a construction of a 
function data base generating device according to 
another embodiment of the present invention; 
35 Figure 29 is a diagram showing an example of 

output results of the device of Fig. 27; 

Figure 30 is a diagram showing the retrieval 
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results as three-dimensional structures; 

Figure 31 is a block diagram showing a 
construction of a function predicting device according 
to another embodiment of the present invention; 

Figures 3 2A and 32B are diagrams showing linear 
structures and non-linear structures, respectively; 

Figure 33 is a diagram explaining the division of 
a point set B into subsets according to the number of 
elements belonging to a point set A; 

Figure 34 is a flowchart showing a process for 
dividing a point set B into subsets according to the 
number of elements belonging to a point set A; 

Figures 35A and 35B are diagrams explaining the 
division of a point set B into subsets according to a 
15 spatial size of a point set A; 

Figure 36 is a flowchart showing an example of a 
process for dividing a point set B into subsets 
according to a spatial size of a point set A; 

Figure 37 is a flowchart showing another example 
of the process for dividing a point set B into subsets 
according to a spatial size of a point set A; 

Figures 38A and 38B are diagrams showing amino 
acid sequences of trypsin and elastase, respectively; 

Figures 39A and 39B are diagrams showing retrieval 
results of three-dimensional structures; 

Figure 40 is a diagram showing a tree structure 
expressing candidates for a combination of 
correspondence between subsets; 

Figure 41 is a flowchart showing a process of 
30 determining correspondence between subsets; 

Figure 42 is a block diagram showing a 
construction of retrieval process device according to 
another embodiment of the present invention; 

Figure 43 is a flowchart showing a process of 
dividing a point set into subsets according to 
secondary structures; 

Figure 44 is a diagram showing the results of the 
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aivision of a point set into subsets according to secondary 

structures; f 

Figure 45 is a flowchart showing a process for 
retrieving proteins using a m ethod of dividing into subsets 
according to secondary structures; 

Figure 46 is a diagram showing an output resul of a 
similar retrieval structure using a protein as a retrieval 

Figures 47A and 47B are diagrams showing a protein 
having a similar structure retrieved by a Key protein. 
DES C R XPTXON OF THE PREFERRED EMB— 

»„, l Y ^« of s e quences 

^T^TT^nTT n^IonTurvey apparatus! 
according to an embodiment of the invention. In Fig the 

42 denotes a display device connected to the gene 

o^^r-^Mic? 1- the reference numeral bu 
information survey apparatus i t ^ 

"^The'gene information survey apparatus X of this 
embodllTt includes an L CS detection unit 30 a ^omoiogy 
d ecision unit 31, a -™ ~-h un ^ 
unit 33, an alignment unit 34, and a display 

" ' T he ,CS detection unit 30 determines an L CS <UW «t 

m fc^d secure input from the input device 40 and a 
character sequence expressing an ammo acid sequence 



data base SO. The intermittently in 

those which commonly occur continuously or inter J 

QO _ nf a t he longest common character 
both character ^^ ^J^ J nstitnting the LCS. 

hi^r :: c ::r::it » — 

detection u„ b ^ omology search unit 32 seaches the 

detection unit 30. A horn gy ^ 

"Tx^" :rrno acl'eUe input £ rom the input 
Tice .Ceased on the decision result of ; — 
The motif search unit 33 searches 
for a motif sequence similar to the amino 
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Llie ucv-j-"* 

decision unit 31. The motif search unit 33 searches the 
motif data base 60 tor a 

acid sequence input from the input ^~ 



"T"""" „,,n-~r,f the LCS detection unit 30. 
detection result of the er sequenc e of the amino 

alignment unit 34 aligns the character q 

acid sequence input from the input ^« ^ £rom the 

character sequence of the amino acid sequence g 

^■=<- a hsc?p 50 or motif data u 
20 amino acid sequence data base 50 ^ ^ 

^ r PSD ective processing units m the aisp y 
the respective P .,.->«, the LCS detection unit 3 0 

A processing carried out by the LCb a 

£ TifLrried out to detect the length of LCS between^ 
the t amino acid ^ ~ ^™ ^ to detect the 
l^ercomr ^sequence L CS between the two amino acid 
sequences to be surveyed and the occurrence position 

thereof . 

in detecting the length of LCS 



between the amino acid sequences expressed by a character 

sequence I and a character sequence II, the LCS detection 

unit 30 reads the characters individually from the character 

sequence I and generates an occurrence table indicative of 

5 the occurrence positions of the respective characters in the 

t r, Hip steo 1 as shown in the processing 
character sequence I n tne step i 

flow of Fig. 2 . 

This occurrence table is generated, for example, by 
linking array elements P[l] to P[26) corresponding to 
10 alphabets A to Z with data of the occurrence positrons of 

the respective characters by pointers 62, as shown rn Frg. 
S For instance, in the case where the amino acid sequence 
of the character sequence I is expressed as " ABCBDAB , the 
occurrence table is generated such that -A" occurs in the 
sixth and first places; -B- occurs in the seventh, fourth, 
and second places; "C- occurs in the third place; and D 
occurs in the fifth place. In Step 1, an array S[i] having 
the same size as the character sequence I, which is used rn 
the subsequent processing, is initialized and a zero value 
20 is set in each entry. 

in Step 2, the characters are successively read from 
the character sequence II and the occurrence positions r of 
these characters in the character sequence I is specified 
with reference to the occurrence table generated rn Step 1. 
Subsequently, in step 3, it is determined whether an entry 
data o£ S[rl. which is in the r-th piace of the array S [, J . 

t i_ _ -f crr-il which is in the (r-i) tn 
is equal to an entry data of S[r-lJ, wnicn 

place thereof. 0 , ^ 

If it is determined that S [r] = S[r-1] in Step 3, Step 
30 4 follows in which "1" is added to S [i] where i * r and 

whose entry data is equal to that of S[r-1] - Subsequently 
in Step 5, it is determined whether the processing has been 
completed up to the last character of the character sequence 
II. If the determination 
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25 



. , n step 5, this routine returns to 
result is in the l ° B f £ is det ermined that S [r] * 

Step 2. on*-- orf to St eps 

r: e irei s ;" th our:Lutin g — - 
step - - ~ re; tr^^^- * 

sequence II read in Step 3 and 4 are 

, plurality of ti.es, "/^^^rence positions r. 
repeated in decrease « J ^ g ha3 been 

I£ it is determined that tn P ^ characte r sequence 
completed up to the last *aracter ^ ^ ^ data 
XX. this rout ^ Ceed s S ° o t L array , W is output as a 
Kmax of a last element S [ml 

Touting/the abo.e processing * 
the case where the amino -^J^J that of the 
sequence I is expressed as ABC itgrjCABA, " "r - 7,.4, 

character -uence » - expre „ ^ ^ p[JJ 

2 n is specified from a list ^ accordance 

out of the occurrence table \ . „ o£ the 

with the reading of the ^ of the array 8l i, 

character sequence II. and th ent Y occurrence table 

ls renewed as shown sequentially f o£ ^ secQnd 

shown in Fi9 5 ^^-fhrirac- sequence II, and the 
character D (3 - ° £ <* d as shown in Fig. 6, 

entry data of the array SU - reading of t he 

„ r . 3.. is specifie - character sequence II, and 

third character C (J - rene wed as shown in 

the entry data of the sequence^ i, _ ^ ^ 

Fig. 6. »r o 6, 1" is specified q£ th£ charac ter 

reading of the fourth character * < - ^ ^ ^ 

sequence II, and the entry data of^ ^ ^ ^ ^ 
renewed as shown in Fxg. 6. 
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c qm S et in this manner give the 

^tt^tzz ri 1:11: s — c^ tl , OE 

length of LCS between a „ haract er sequence I and 

the first to i-th characters of he * h 

t he character subsequence consxstrng of the fx 
charact ers of the character sequence I after t 3 
character of the character ~ f _ the 

Thereafter, -r - 7 4 2 " with the 

occurrence tabl< ; * ^ * q£ che charact er sequence 

reading of the fifth character rene „ed as shown 

»• - r r" y s dat ; r. recifira :l 

^"shown in "«*'. 5 in accordance with the reading of the 
si x th character A of the char acter = ^ ^ Fig . 

hould be Y ncted that the array SU, ~ - 

a dditionally rlhl length of the 

therefore has a size that is larg 

character sequence ^-^^^ longest „ 

The processing to aecermmc 
subsequence between the two ^"J^^ 
surveyed and the occurrence position thereo 

-■-r^i^r- r«^u reads T 

Charters fro. the <^ "TZ^T^ * 

occurrence table indicative of *° C "^ ce , in 

th e respective characters in the character seque 

step 10 as shown in the processing flo of Fi9 

detecting ^^^^^ I and XX 

irtllTcrreirpcsitioI thereof. fis^ 

^ aKlp Ascribed with reference to tig. 
occurrence table descnhea ^ 
generated. In Step 10, an array S[i] having 
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size as the character sequence I, which is used in he 
subsequent processing, is initialized and a zero value s 
sft in each entry. Further, an array data M having the 

ze corresponding to the length of LCS is initialized and 
the respective entries are set so as not to point to 

^l^Step 11. one character (j -th character, is read fro. 
the character sequence XI. and the occurrence P-- - 
this character in the character sequence I is 
reference to the occurrence table generated in Step 10. 
reference dete rmined whether an entry 

Subsequently, in Step 12, it is aet 

4_v,~ v t-Vi nlace of the array ouj / 
data of S[r], which is m the r-th place 

n f srr-ll, which is m the vr x; un 
is equal to an entry data ot Sir u, 

cM1 if it is determined that S Lrj 
nlace of the sequence S UJ • 1U 

1, in Step 12. Step 13 follows in which "1» is added to 
, where 1 > r and whose entry data is ^1 to that o 
8 r-U . On the other hand, if it is determined that S M 
8 r-1 n Step 12, this routine proceeds to Step 17 of the 
p oc sing flow of Pig. 5 without executing the additiona 
^ sing in Step 13. I- «» case where the characters f 
L character sequence 11 read in St ^ — o£ 

r;::: ^rr;, 1 - - — — ° f the 

° CC TnThir:ir:--S detection unit 30 also executes 
the p ocessing 1 as to detect the length of ,CS described 
in the processing flow of Fig. 2 in detecting the longest 
common subsequence. _, ired 
After execution of the processing of Step 13, paired 
data (r, j) including the occurrence position r in the 
character^ sequence 1 and the occurrence posi ^ ^ 
character sequence II is ^'"^ ' o^J 

14 in accordance with the length of LCS k, whic 
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in entry data of S[r]. In fact, the paired data (r, 
j) is stored at the last of the list linked to the 
array data[k] . If the array S[i] is unchanged from 
the one in the preceding processing cycle, the above 
storing processing is not executed. 

Subsequently, this routine proceeds to the 
processing flow of Fig. 4 and, in Step 15, it is 
determined whether relationships r'< r, j'< j are 
satisfied with respect to each of the character 
positions r', j' stored in the data[k-l]. Since the 
character positions cannot be reversed in the 
subsequences, the above relationship must be satisfied 
along a subsequence. Therefore, the data (r, j) is 
linked to the data (r', j') in Step 16, only when the 
above relationship is satisfactory. In subsequent 
Step 17, it is determined whether the processing has 
been completed up to the' last character of the 
character sequence II. if the determination result . is 
in the negative in Step 17, this routine returns to 
Step 11 of the processing flow shown in Fig. 3. On 
the other hand, if it is determined that the above 
relational expressions are not satisfied in Step 15, 
this routine proceeds to Step 17 without executing the 
processing of Step 16. 

If it is determined that the processing has been 
completed up to the last character of the character 
sequence II in Step 17, this processing flow ends. 
The longest common subsequence and the occurrence 
position thereof are determined by tracing back the 
link set in Step 16, as will be described in detail 

later. . 1 =«H 

An example of the processing shown in Figs. 3 and 
4 will be described with respect to a case where a 
first amino acid sequence is expressed by the 
character sequence I " ABCBDAB" and a second amino acid 
sequence is expressed by the character sequence II 
" BDCABA" similar to the aforementioned example. 
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As shown at a left end of Fig. 6, since r = 7, j = 
1, and k = 1 when S[r] is first renewed in Step 13 of 
Fig. 3, data (7, 1) is stored in a datafl] by being 
linked thereto in Step 14 of Fig. 3 as shown in Fig. 
7. Thereafter, data (4, 1), (2, 1) are stored. 

Since nothing is stored in a data[0] set, for the 
sake of convenience, the processing of Step 16 is not 
applied thereto. Since S[r] is renewed when r = 5, j 
= 2, and k = 2, data (5,2) is stored in a data[2] as 
shown in Fig. 7. In Step 15, the relationships r' < r 
and j' < j are satisfied for the data (4, 1) and (2, 
1) among the data (7, 1), (4, 1), and (2, 1) stored in 
the data[l]. Accordingly, the data (5, 2) is linked 
to the data (4, 1) and (2, 1) through pointers 70, 7 2 
shown in Fig. 7 in Step 16. By repeating the 
aforementioned processing, a linked list shown in Fig. 
7 is generated. As shown at the right side of Fig. 6, 
the data (1, 6) is not stored in the data[k] since 
S[r] is unchanged when r = 1 and j = 6. 

The longest common subsequence and the occurrence 
position thereof are determined by tracing back the 
pointers of the character position information stored 
in the data[k]. If this is explained more 
specifically using the example of Fig. 7, the link 
"(7, 5) of the data[4] --> (6, 4) of the data[3] — > 
(5, 2) of the data[2] -> (4, 1) of the data [1]" xs 
traced and arranged in reverse order, thereby 
determining the longest common subsequence BDAB and 
the occurrence positions in the character sequences I 
and II. Also, the longest common subsequence BDAB and 
the occurrence positions thereof in the character 
sequences I and II are determined from the link "(7, 
5) of the data[4] --> (6, 4) of the data[3] -> (5, 2) 
of the data[2] ~> (2, 1) of the data [1]-. Further, 
35 the longest common subsequence BCAB and the occurrence 
positions thereof in the character sequences I and II 
are determined from the link "(7, 5) of the data[4] - 
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> ,6 4) of the data[3] -» (3, 3) of the data[2] -> 

2 ( i) if the data [I]- the longest common 

lleU* BCB A and the occurrence 

t ^j^rrr^ if*. «.~ (4 . S) * 

from the linx i o , « ; . 
data[3 , -» (3, 3, of the daf[2, -» (2. D »f 

*h«> LCS is taken from a data[Kmax] . In Step 22, 

the LCS is tajv called to trace and 

^x^:- orris - r - 

.till remains in the datatKmax] T ^ ^ gtep 

the processing is I^o'tinned 
22 if any data remains. This ro 

un til the linKs of all the LCS are ^ . 

retrieval processing subroutine shown in Fx, > 
retrieval p determined 
recursive routine. In Step 30 ^ ^ ^ 

:.rr.rrrr. !. 

a «.: r r -r;rr.:v t !r.-.:t,:.. - 

,,, ml «. «•« " " «... 

.„« ,» .... s«. >»• " » 
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callinq the subroutine recursively. Upon completion of the 
It ssing of Step 44, the next pointer is taken out in tep 
Z Ina this subroutine returns to Step 36, thereby execute 
processing for the next branch. 

P B y executing the above processing, for example, he 
data <7. S) . (6. 4,. (5, 2>, and (4,!, are 

E axen out in the example shown in Frg. 7, the LCS BDAB a 
th e occurrence position thereof are ou tpu h n, « ^ 

taken out to obtain the data (7, 5), lb, 

t ana the LCS "BDAB" and the occurrence posit on thereof 
are output. Further, the data (7, 5), (6, 4), (3, 3), and 
are output. ,, B CAB" is output. Moreover 

(2 1) are obtained and the LCS bcab r f 

sl (4 5) (3, 3), and (2, 1) are obtained and 
the data (6, 6), (4, 5), IJ, ^ 

the LCS "BCBA" is output, in this way, all the LCS 
"^'processing, such that the respective processing units 
31 to 3S of the gene information survey apparatus 1 shown rn 
Fig 1 execute upon receipt of the length of LCS the 
longest common subsequence, and their occurrence posrtrons 
deleted by the LCS detection unit 30. will be described^ 
When the LCS detection unit 30 decides the length of 
L CS between the character sequence of the amino acrd 
sequence input from the input device 40 <^ern*fter 
referred to as an input amino acid sequence) and the 

inpl amino acid sequence. In the case where this rat o xe 
greater than a predetermined reference value the rnput 
amino acid sequence is determined to be homologous wrth 
amino acid 



sequence given from the amino acid sequence data base 
50 or the motif data base 60. In the case where this 
ratio is smaller than the predetermined reference 
value, the input amino acid sequence is determined not 
5 to be homologous with the amino acid sequence given 
from the data base 50 or 60. 

Based on the decision result of the homology 
decision unit 31, the homology search unit 32 seaches 
the amino acid sequence data base 50 for an amino acid 
10 sequence being homologous with the input amino acid 
sequence. In the case where the two amino acid 
sequences are homologous, the ratio calculated by the 
homology decision unit 31 and the longest common 
subsequence determined by the LCS detection unit 30 
15 are displayed in the display device 42 through the 
display control unit 35. 

Fig. 10 shows an example of this display. The 
display example displays a processing result of two 
amino acid sequences: human cytochrome c and bacteria 
20 cytochrome c. The longest common subsequences are 

displayed in accordance with a display mode indicative 
of the interval at which they are arranged in the two 
amino acid sequences. More specifically, by adopting 
a mode of displaying "GD <x 3, 3} G {x 0, 1} K <x 0, 
25 2> the longest common subsequences are displayed 

as follows. In the human cytochrome c, "GD" is 
followed by three characters that do not coincide, 
followed by »G», which is immediately followed by »K" . 
On the other hand, in the bacteria cytochrome c, "GD 
30 is followed by three characters that do not coincide, 
followed by »G", which is followed by one character 
. that does not coincide. "K" follows immediately 
thereafter. 

The motif search unit 33 first searches the motif 
35 data base 60 for the motif sequence being homologous 
with the input amino acid sequence based on the 
decision result of the homology decision unit 31, and 
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then decides whether the homologous motif sequence xs 
a true motif sequence included in the input amino acxd 
sequence in accordance with the longest common 
subsequences determined by the LCS detection unit 30 
and the length of the character sequence between the 
longest common subsequences. For instance, it xs 
determined whether the input amino acid sequence 
includes a motif sequence called leucine zipper xn 
which "Li" is followed by unspecified six characters, 
which is followed again by and a total of 5 L 
are included together with the six unspecxfxed 
characters. In the case where the input amxno acxd 
TZTol includes the motif sequence, the motif search 
unit 33 displays the input amino acid sequence and the 
motif sequence in the display device 42 through «>e 
^ n „ n1 -i- ^ Fia. 11 shows a display 

riisulav control unit J3- "y* 

example of a rat egg cell potassium channel xncludxng 
a motif called the leucine zipper. 

Upon receipt of the longest common subsequences 
and thexr occurrence positions that the LCS detection 
unit 30 detects, the alignment unit 34 alxgns the 
in^t amino acid sequence and the amino acid sequence 
given from the amino acid sequence data base 50 and 
IL motif data base 60 so as to relate the longest 
common subsequence in one amino acid sequence t that 
in the other, and displays the aligned ™">™* 
sequences in the display device 42 through the display 
control unit 35. Fig. » shows an exam pi of this 
display, which displays a processing result of two 
amino acid sequences: human cvtochrome c and bacte r a 
cytochrome c. The alignment proces sing xs carri 
by inserting a blank corresponding to the length of 
• the character sequence between the posxtions of 

subs :^L^e^^ 

fSTot partially relating elements including 
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an atom or an atomic group in three-dimens.onal 
structures of molecules, particularly profxn 
molecules, and comparing with each other, vUl - 

"^^rlnstance, it is assumed that there are 
substances expressed by a point set A 
- -^r^t^s^hlTnX-^he 

related to each other as shown in Fxg. 13C, 
Trice B is rotated and moved so .at 
val between the correspond. g^ ^ ^ 

as shown in Fig. J-ju denotes a rotation 

in the following equation where.n U denote 
ma trix and w, denote respective weights: 

{vs k {Ub k -a k ) 2 ) ) 2 
r.w.s.d. J5 

:0 A technic of obtaining the rotation and — of 
the substances which ^ s \ xopose * by 

between these corresponding points i. P * 
Kabsh et .1. as described above, and is presen 
widely used. rorrespondence 

25 1. various Methods of l*"™^^ point 

(1) Generation of correspondence of p 

sets that are not ordered 
T „e substances A and B are expressed, 
respectively, by the point sets A - 

a V Kia, the point set B - 0». by • ' 

30 m r^n The respective points a, - Yw 

"" ' I i - <x v., are expressed as a three- 
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a tree construction as shovm in Fig. 14A. 

Fig. 14B shows an example of 
correspondence in the case where a point set A 
includes three elements and a point set B includes 
four elements, i.e., the correspondence between the 
point set A = {a lf a a , a 3 } and the point set B = {b w 
b 2 , b 3/ b«}. A dotted line represents generated 
candidates, and a solid line represents an optimum 
correspondence (a, and b 2 , a 2 and b 3 , a 3 and b 4 ) among 
all the generated candidates. 

In this figure, nil corresponds to a 
case where no corresponding point exists. By using 
the nil, an optimum correspondence can be generated 
even in the case where the number of elements of one 
set differs from that of the other. An optimum 
correspondence can be generated by applying Kabsh's 
method to thus generated combinations, and selecting a 
combination whose root mean square distance value 
(r.m.s.d. value) is smallest. 

However, using this technique it is 
generally impossible to effect a calculation since, 
for example, n- combinations are generated. 
Specifically, In the case of the point set A (m 
points) and the point set B (n points), which are not 
ordered, if (i) is assume to be the number of nxl the 
number of generated combinations is expressed as 
follows: 



Here, if it is assumed that n = 4, m = 3, the above 
equation is expressed as follows. 



3 4! 3 1 

^ (^-i^i^E (4-3 + i) ! X i! <3-i> ! 



= 24 + 36 + 12 + 1 = 73 
In other words, 7 3 combinations are generated, as in 
the case of the point set A (3 points) and the point 
set B (4 points) shown in 14B. In reality, a huge 
number of combinations are generated since the number 
of points (elements) are usually far greater than 
these. 

Accordingly, in generating 
correspondence between these sets, it is designed to 
generate an optimum combination in view of the 
geometric relationship within the respective sets, the 
threshold value condition, and the attribute of points 
described in detail in (4), (5), (6) below. 

Fig. 15 shows an example of algorithm of 
generating correspondence between the point sets A and 
B including elements, namely points, that are not 
ordered . 

The elements a are taken individually 
from the point set A, and combined with elements bj, 
which are not included in ancestors or siblings in the 
tree structure yet. Then, it is determined whether 
this combination satisfies a restriction condition to 
be described later. If the combination satisfies the 
restriction condition, it is registered in the tree 
structure and the next element is related. 

(2) Generation of ordered point sets 

The substances A and B are expressed, 
respectively, by the point sets A = {a lf a 2 , . a t , 

a,}, ISiJSm, and the point set B = {h lt b 2 , . . . / j/ 
l*j^ n - The respective points a t = (x t , Yi/ 
z[) and bj = <*j, Yj , z,) are expressed as a three- 
dimensional coordinate. In the point set A, an^order^ 
relationship is established: a, < a 2 < ... < a i 



> a > > a m ) . Likewise, in the 

point set B an order relationship is established: b, < 
. l. , v, ■> h. > . . . > b, > • • • > 



b 2 < 
b»)- 



b, < ... <b, (orb, > 



in this case, elements of these point 
sets are in principle related to each other in 
accordance with the order relationship, and all 
combinations can be generated by creating a tree 
structure shown in Fig. ISA. Fig. 16B shows an 
example case where the point set A includes three 
elements and the point set B includes four • 
in other words, Fig. 16B shows the 

between the ordered point set A - <a„ «„ M I« « 
relationship thereof is: a t < a 2 < .,) and the ordered 
point set B - <b„ b 2 , b„ b i} (order relationship 

thereof! b, < b 2 < b, < b») . =r .„f ad 
A dotted line represents generated 
candidates for correspondence, and a solid line 
represents an optimum correspondence (a and b„ a 2 and 
b,! a, and b.) among the generated candidates. In thrs 
figure, nil corresponds to a case where no 
corresponding point exists. By using the 
optimum correspondence can be generated even the 
cL. where the number of elements of one set to be 
related differs from that of the other to be related 
Z optimum correspondence can be generated by applying 
Kabsh's method to thus generated combinations , and 
selecting a combination whose root mean square 
distance value (r.m.s.d. value) is smallest, 
dxstance ^ ^ combina t,ons » 

expressed as follows in the case of the ordered point 

sets: 



m a n! ins 

£ (n^i^ c i ) =E "(jn-i) ! i'n-m*i) ! * il On-i) ! 

1=0 la Her 

it is assumed that n = 4, m - 3, the number of 



combinations is as fo. 




ai 3! 41 31 + _u_ >< _iL- + -i!x-!- : ; 

= 4 + 18 + 12 + 1 = 35 v , . . _ 

In the case of the point set A ,3 points and he 
point set B (4 points) as shown « Fxg. KB, 
combinations ^ .eiationship is applied to 

the respective elements within the point sets in tiu. 

Je of ^^ZrJTr. lating 

reduced greatly compared to (1). generat ed in 

these sets, an optimum combination can be gen 
view o£ the geometric relationship wxthin the 
I pecLe sets, the threshold value =« ^ -on 
the attribute of points described x» detarl ( ) 
(5) , (S) below^ ^ ^ ^ exanpie Qf an 
for relating elements of the ordered point sets A and 

The elements a are taken individually 
£ rom the point set A, and combined with element ■ b 
which are not yet included in ances ,ors or s,b g 
the tree structure and ^whether this 

element is "^^^ q£ correspond ence of ordered 
1 ' or nonordered point sets that are 
partially related to each other. 
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in the case of (1) or (2), there are 
cases where pairs of points that are partially related 
ars determined in advance. In this case, whxle 
referring to information on the elements related xn 
advance, the regaining elements of the respective 
point sets are sequentially related similar to the 
technique (1) or (2), thereby creating a tree 
structure as shown in Fig. 19. In this way, all 
combinations can be generated. „„,,■<<,„ 
in Fig. 18, indicated at x xs a portion 
to be pruned based on the partial correspondence 
^his f igure shows a correspondence in the case where 
'he element a, of the point set A and the 
the point set B are related to each other xn advance. 
Similar to (1). (2). 1» bating these sets an 
optimum combination can be generated xn vxew of the 
aeometric relationship within the respectxve sets the 
^hold value condition, and the attribute of poxnts 
described in detail in (4), (5), (6) below. 

(4) Refining of candidates based on a 
geometric relationship 
Since the generation of unnecessary 
combinations can be prevented by 
correspondence between elements of point sets 
considering a geometric relationship, the poxnts 
can be related efficiently. 

( a) Refining of candidates based on a 
distance relationship 
in relating the points set, there 
is a distance relationship established between . (1 - 
, n -1) points close to an element a, w.thxn 
, | a a I, and another distance 
the point set A: | a t a t _.| , an ^ 
r elationship established between point c 
element b, within the point set B | bj 
numb er of candidates to be related - b ^ ^ 
selecting and relating ^ ^ " \ ^ wherein ftd 
relationship: | |a t - a t _.| - |b, - b j-s! - 
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denotes a permissible error range. 

Figs. 19A and 19B show an example 
using the geometric relationship in the case where the 
point bj of the point set B corresponding to the 
element a, of the point set A is selected. Each 
numerical value in these figures shows a distance. 

As shown in Fig. 19A, there is 
assumed to be a distance relationship established 
between two (s = 2) points a t . lf a t _ 2 close to the 
element a L of the point set A: | a t - a t _! | =2.0, |a 4 - 
a t . 2 | =3.0. As shown in Fig. 19B, among the elements 
b p ", b q , b r of the point set B is selected such a point 
that a distance relationship between two elements 
close to this point lies within the permissible error 
range Ad = 0.5, and this point is related. In this^ 
example, the point b p (|b p - bj.J = 2.2, |b p - b 3 . 2 l - 
3.3) is found to satisfy the distance relationship as 
a result of comparing the distance between the points 
as a geometric relationship, the point b p is selected 
as a candidate for b r 

(b) Refining of candidates based on an 
angle relationship 
In the case where the three- 
dimensional structures are similar to each other, it 
can be considered that angles defined by the 
respective points constituting the three-dimensional 
structures are also similar. In a three-dimensional 
structure, there exist an angle 9 defined by three 
points and an angle * defined between planes formed by 
three among four points. Hereafter, a method of 
reducing the number of points to be related will be 
- described, taking the angle 6 defined by the three 
points as an example. 

in relating the sets, the number of 
candidates for a point to be related is reduced by 
selecting and relating such points from the point sets 
A and B such that an angle defined between s (2 S s 



m - 1, n - 1) elements close to element h i of the point 
set B relative to an angle defined between s points 
close to the element a t of the point set A lies within 
a permissible error range A6 . 

Fig. 20B shows a case where, 
considering angles defined by respective elements as a 
geometric relationship established between the 
elements of the point set A, the points of the point 
set B are related based on this consideration. 

In the case where an angle defined 
by the element a t of the point set A and two (s = 2) 
points an, aw close to the element a t is 8 a , and 
angles defined by the elements b p , b q , b r and two 
elements b H , bj_ 2 close to these elements b p , b q/ b r are 
6 , 6 q , 0 r , points such that an angle difference lies 
within tne permissible error range A8 are selected and 
related. In this figure, since only the point b p 
satisfies the relationship: 1 6. - 6 P | * A9 , the point 
b is selected as a candidate for bj. 

(c) Refining of candidates based on 

distances and angles from a center 

of gravity. 

If the three-dimensional structures 
are similar to each other, it can be considered that 
distances and angles from a center of gravity are 
similar. Accordingly, the number of candidates for a 
point to be related can be reduced by calculating the 
center of gravity from the selected points, and 
comparing the distances and angles using a technique 
similar to (a) and (b) . 

( 5) Refining of candidates based on a 
threshold value condition 
The point sets can be more efficiently 
related by setting a specified threshold value in the 
aforementioned methods (1) to (4), and pruning a 
retrieval path if an attribute value of a candidate is 
greater than this threshold value. As this threshold 
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value, for example, restriction in a nil number (the 
number of nil) and restriction in a r.m.s.d. value can 
be used. 

(a) Restriction in a nil number 
5 When a total number of nil becomes 

too large among the generated combinations, 
meaningless candidates for combinations are generated 
as a result. Accordingly, in relating the elements of 
the point sets A and B, if the total number of ml 
10 becomes in excess of a given threshold value, the 
generation of the unnecessary candidates can be 
prevented by excluding these from the candidates, 
thereby relating the elements more efficiently. 

Fig. 21 shows an example of pruning 
15 in a case where a total number of nil is restricted to 
0 in relating a point set A = {a,, a 2 , a 3 > to a point 
set B = {bl , b 2 , b 3 , M. in this figure, a portion 
designated at * in a tree structure is a portion to be 
pruned. 

2Q (b) Restriction in an r.m.s.d. value 

In the case where an r.m.s.d. value 
of all the points related thus far becomes exceedingly 
bad by relating an element a, of a point set A to an 
element bj of a point set B, it is preferable to 
25 exclude this point from consideration of the 

candidates. In view of this, the r.m.s.d. value of 
all the points when the element a, is related to the 
element bj is calculated, and this point is selected as 
a candidate if the calculated r.m.s.d value is not 
30 greater than a given threshold value. On the 

contrary, this point is excluded from the candidates 
if the r.m.s.d value is in excess of the given 
threshold value. In this way, the candidates » <°r * 
point to be related can be generated more efficiently. 

35 (6) Refining of candidates based on an 

attribute of a point 
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be related can be reduced * an - ^ ^ ^ ^ 

point in relate an element a o ^ ^ 

element b, o£ a pent set B. ^ an 

s point, for example, ^J^^ilic 

atomic croup, and a molecule, * ^ positive 

^nr- * — - by checkin9 whether 

» «"~ ."^J^, in th e case o £ retina 

elements ,ccn.tl^ P^^^"^^,,., by 

candidates for a point to due (corre sponding 

using the type of an ^ ^ o£ t L point, 

to an atomic group, as an attrxbu ^ ^ 

yarding the ^^^^cL sue. as fundamental 
like, please refer to r DQhjln 
to Biochemistry," PP- 21 
— • further, the candidates 

be related can be reduced n ^ ^ 

specific element. For example, th restriotion 
Retrieved can be ^^J^ ^certain P»i»* °< » 
that the nil is not ,nse rted to ^ 
i designating an attribute of point 

, . Adaptation Examples . les whe re 

Described below are adapta ^° hree . diB ensional 
th e theme consists of a protein ..J^ ^ i3 nQ 
structure of a ^^J^ the sublet 
l0 particular limitation ^ and the 

Lsically has ^--^^^n Ise having general 
invention can be adapte t . • ^ fflethod . 

m olecules are superposed one upon another. 
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arific is discriminated so as to analyze 
common portion or specific is ax 

OTer T/" ^Ti-t- -station o* a device 
tha t Splays U- Secular structures in an ;r; a ppe : ±> 
manner according to the present - * iBCered data 

constituted by a d a ta base 80 in which r g ^ ^ 

related to the ^ » -;:l red data and an 

data input unit 82 that superposit ion calculation unit 

— rsrs :u"ii-. r - 
.„,„„. ™ - -»•■<• - ■ 

application r.m.s.d. v a iue thre e-dimensional 
graphic display unit 86 that dl * 1 '^ 0 ^ the calculated 
structures in an overlapped nanner b a sed 

results . 

(a) Data base 80. related to 

The T^^T stores the 

r a :irrs:ra:ce::T h :ee-di m ensional coordinates o £ atoms 
constituting the substances, etc. 

(d , Data input unit 82 ^ ^ 

The data input unit 82 reads £ ances tha t 

t he data (three-dimensional coordinates ) ^ 

are to be superposed based on an npu ^ ^ 
and sends the data to the superposition cal 

substances m order to * p 

structures (three-dimensional coordinates) 
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of substances according to the method of superposition 
discussed in Section i, entitled "Various Methods <* 
Determining Correspondence", on page 28 of this «PP^» 
in a .anner such that optimum r.m.s.d values are obtained, 
and sends the results to the graphic display unit 86. In 
' determining the correspondence, there is provided a function 

tn finds correspondence between spatially similar portions 
based on the order of amino acid sequence that constitutes a 
protein, and a function that finds correspondence between 
„ Orally similar portion irrespective of the order of ammo 

0 lold "e L::. ^retrieving the spatially similar portions 

based on the order of amino acid sequence, ammo acids 
constituting the protein can be grasped as an ordered set 
whTse elements are ordered according to the nu^ers of ammo 
acid sequence, and the4efore similar portions can be 
calculated based on the methods discussed in Section, 

v. h- = (91 (3) 94), (5), and (6) on pages 30, 32, 33, 
TZ specttelyl 'of this application. By grasping 

the amino acids simply as a nonordered set, furthermore, it 
is possible to calculate spatially similar ^tions 
irrespective of the order of amino acid sequence relay ng 
upon the systems mentioned in section 1. subsections (1), 
( 3), (4), (5) and (6) on pages 30, 32, 33, 35, and 36, 
respectively, of this application. 
,_ (<J) Graphic display unit 86. 

The graphic display unit 86 displays the 
th ree-dimensional structures of substances in a <™P«P»^ 
manner based on the results calculated by the superposition 
Halation unit 84. Upon looting at the displayed resul 
30 while manually rotating it, it is understood what portion 

are superposed and how they are superposed in a 3D graphic . 

Fig 23A shows an amino acid sequence of 
calmodulin, which is a protein, and Fig. 23B shows an amino 
acid sequence of troponin C. Pigs. 23, and 23B show in 
35 excerpts the amino acid -^'^^ ^no acids 

The amino acid sequence shown xn Fig. 2 3A 
that correspond to amino acid. 
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sequence Nos . 1-4 and 148 included in the ordinary 
amino acid sequence and, hence, the numbers are 
shifted. Hereinafter, these diagramed amino acid 
sequence numbers will be used. As shown in Fig. 24A, 
it is known from results of biochemical experiments 
that calmoduline can bind four Ca 2+ as indicated by 
black rounds. Also, it is known that troponin C can 
bind two Ca 2+ as indicated by black rounds in Fig.. 24B. 
It is known that calmoduline has four places (sites) 
to bind Ca 2+ in its amino acid sequence and among these 
amino acids of sequence numbers 81-108 and 117-143 
form skeletons similar to those of two sites to bind 
Ca 2+ in troponin C. A protein is constituted by amino 
acids and it is known that its skeleton can be 
15 represented by the coordinates of atoms (Ca) that 

constitute the amino acids. Fig. 25 shows the results 
obtained when a spatially similar portion (a single 
site) is searched for based on the order of amino acid 
sequence using the Ca 2+ binding site 81-108 of 
calmodulin as a probe. Fig. 25 indicates that the 
amino acid sequence numbers 96-123 in troponin C 
correspond to the Ca 2+ binding sites 81-108 in 
calmodulin. These results are in agreement, with the 
biochemically experimented results. Fig. 26 shows the 
results obtained when spatially similar portions (a 
plurality of sites) are searched for based on the 
order of amino acid sequence using Ca 2+ binding 
site 81-108 and 117-143 in calmodulin as probes. 
Fig. 26 indicates that the amino acid sequence 
numbers 96-123 and 132-158 in troponin C correspond to 
the Ca 2+ binding sites 81-108 and 117-143 in 
calmodulin. These results are in agreement with the 
biochemically experimented results, too. By using the 
apparatus of the present invention as described above, 
correspondence among the constituent elements of 
substances can be calculated in a manner such that the 
s d. values are minimized in the three-dimensional 
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structures of the substances. By displaying the 
corresponding portions in a superposed manner, 
therefore, it becomes possible to display the 
substances in a superposed manner in an optimum 

condition. t 

(2) Three-dimensional structure retrieval device 
and function data base generating device 
It is essential to clarify a correlation 
between the function and the structure of a substance 
in order to develop a substance having a new function 
luch as a new medicine or to improve the function of a 
substance that already exists. To promote the 
aforesaid, work, it becomes necessary to make 
references to many substances having similar three 
dimensional structures. This necessitates a three- 
dimensional structure retrieving device that 
capable of easily taking out the substances having 
similar three-dimensional structures form the data 
base. Moreover, a device of this kind makes it 
possible to prepare a function data base in which are 
collected three-dimensional structures that are 
related to the functions. The function data base will 
be described later in (3). Fig. 27 is a diagram 
illustrating the system constitution of a three- 
dimensional structure retrieving device that - 
constituted by a data base 80 that stores three- 
dimensional structures of substances « 
unit 82 that reads the data registered to the data 
base 80 and an input command of a user, a 
calculation unit ■ 88 that retrieves structu re* > similar 
to three-dimensional structures (^ee-dimens-nal 
. coordinates, of substances read ^ ^ ^he 
and which minimize the r.m.s.d. va± , 
method of superposition mentioned in ^e *apter , 
a nd a retrieved result ^ , r m Showing 

the retrieved results. Fig. 28 is a di g 
the system constitution of a device that generates 



function data base. 

(a) Data base 80 . 

The data base 80 stores the data related to 
three-dimensional structures o£ substances, i.e stores the 
names of substances, the three-dimensional coordinates of 
atoms constituting the substances, etc. 

(b) Data input unit 82. 

The data input unit 82 reads the data of 
three-dimensional structures that serve as keys for 
retrieval and the data of three-dimensional structures 
registered to the data base 80 that will be referred to 
during the retrieval based on the input command from the 
user and sends the data to the similarity calculation unit 

88 ' (c) similarity calculation unit 88. 

The similarity calculation unit 88 calculates 
optimum superposition of three-dimensional structures At 
ZTs moment, there are provided a function for retrieving 
spatially similar portions based on the order o amino acid 
sq uence that constitutes a protein, and funct on for 
retrieving spatially similar portions irrespective o the 
order of amino acid sequence. In retrieving ^e spatrally 
similar portions based on the order of amino acid sequence. 
Ho acids constituting the protein can b< . grasped as an 
order set whose elements are ordered according to the 
numbers of amino acid sequence, and therefore -™ llar 
portions can be calculated baaed on the methods ^"bed n 
Lotion 1, subsections (2). (3), <*> . <*> ■ 15 ,° n P *f S 

30 3*. 33, 35, and 36, respectively, of this application. 
Bv grasping the amino acid simply as a nonordered set 
Lthermore. it is possible to calculate spatially similar 
Portions irrespective of the order of amino acid sequence^ 

T:zr^r™T^r:::z - ». ^ and 

36 respectively, of this application. 

(d) Retrieved result display unit 90. 

The retrieved result display unit 90 



expresses similar portions as amino acid sequence 
names and amino acid numbers based on the results of 
the similarity calculation unit 86, and displays 
r.m.s.d. values as a scale of similarity. 
5 Fig. 29 shows the results obtained when 

similar three-dimensional structures are retrieved 
form the PDB using, as probes, coordinates of Ca 
corresponding to the amino acid residue Nos. 7 to 14 
in elongation factor of protein which is a bindxng 
0 site for phosphoric acid of GTP (guanosine 

triphosphate). Retrieval is carried out over 744 
three-dimensional structures of protein among 824 data 
registered to the PDB. Fig. 29 shows amino acid 
residue numbers of a target protein that is retrieved, 
L5 an amino acid residue sequence, an amino acid resxdue 
sequence of a probe, and r.m.s.d. values between 
target and probe three-dimensional structures. As a 
result, eight three-dimensional structures are 
retrieved (including probe itself). If classifxed 
20 depending upon the kinds of proteins, there are 

retrieved three adenylate kinases, two elongatxon 
■ factors (between them, one is probe itself) and three 
ras proteins, all of them are the sites where 
-phosphoric acid of ATP or GTP is bound. Thus, the 
25 function of sites binding phosphoric acid of ATP or 
GTP has a very intimate relationship to thexr three- 
dimensional structures and their structures are very 
specific because they never incidentally coincide wxth 
other structures that are not phosphoric acid bindxng 
30 sites. in Fig. 30, the retrieved results are partly 
shown by their three-dimensional s " uc ™ ^ 
By using this device as descrxbed above, 
it is possible to retrieve similar structures from the 
data base in which are stored three-dimensional 
35 structures of substances by designating the three- 

dimensional structure of a substance that serves as a 
probe . 
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(3) Function predicting device. 

AS will be implied from the results shown in Fig. 
29 it is considered that a protein has a three-dimensional 
structure that specificaly develops its function. 
Therefore, if a data base (hereinafter referred to as 
function data base, of three dimensional structures specific 
co the function is provided for each of the funct ions, the 
it becomes possible to predict what function is exhibited by 
a substance and by which portion (hereinafter referred to as 
function site) of the three-dimensional structure the 
function is controlled by examining whether the structures 
registered to the function data bases exist within the 
three-dimensional structure of the substance is newly 
determined by the X-ray crystal analysis or NMR Fig. 
illustrates the function predicting device which is 
constituted by a data input unit 82 that receives as inputs 
the three-dimensional structures of substances, a function 
data base 92 to which are registered the three-dimensional 
structures that are related to functions, a I«™*ion 
prediction unit 94 that performs optimum superposition of 
the three-dimensional structure read from the function data 
hase 92 and the three-dimensional structure of a substance 
that is an input based on the method of retrieving the 
three-dimensional structure described in section 1 on page 
28 in order to determine whether the three-dimensional 
structure includes a structure related to the function and 
specifies the function sites, and a predicted result display 
unit 96 that displays the predicted results. 

(a) Data input unit 82. 

The data input unit 82 reads the data of 
three-dimensional structures constituting substances and 
sends them to the function prediction unit. 

(b) Function data base 92. 
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The function data base 92 stores the 
functions of substances and data related to three- 
dimensional structures specific to the functions . The data 
base stores the names of functions, and three-dimensional 
coordinates of atoms constituting three-dimensional 
structures specific to the functions, etc. The function 
data base 92 is formed by a function data base-generating 
device (Fig. 28, that is constituted similarly to the three- 
dimensional structure retrieving device descrxbed in (2) 
above . 

(c) Function prediction unit 94. 

The function prediction unit 94 calculates 
the optimum superposition of three-dimensional structures 
registered to the function data base 92 and three- 
dimensional structures that are input. At this moment, 
there are provided a function for retrieving spatially 
similar portions based on the order of amino acid sequence 
that constitute a protein, and a function for retrieving 
spatially similar portions irrespective of the order of 
amino acid sequence. In retrieving the spatially similar 
portions based on the order of amino acid sequence, amino 
acids constituting the protein can be grasped as an ordered 
set whose elements are ordered according to the numbers of 
amino acid sequence, and therefore similar portions can be 
25 calculated based on the methods described in section X, 

subsections (2), (3), (4), (5) and (6, on pages 30, 32, 33 
3S and 36, respectively, of this application. By grasping 
the amino acid sequence simply as a nonordered set, 
furthermore, it is possible to calculate spatially simi ar 
30 portions irrespective of the order of amino acid -quence 

relying upon the systems mentioned in section 1, subsections 
(3), (4), (5) and 6 on pages 30, 32, 33, 35, and 36, 
respectively, of this application. 

(d) predicted result display uni t96. 

The predicted result display unit 96 
expresses the names of functions, names of amino acid 
sequences at function sites and amino acid residue 
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numbers registered to the function da a b as re Y»g 
on the results of the function pred.ct.cn u . 94, 
displays r.m.s.d. values as a scale of similarity. 
P 7 ^^^hjee^imensjnnal Structured 

Molecules II 

In the aforementioned method of imparting 

c , mUar structures were successfully 
correspondence, similar s«uu 

relations such as distances among the elements in a 

Tit r > s.d. values and the number of nils, as 
'Ta attributes of constituent elements (Kind, of 
well as attrxouueo an d bv finding 

amino acids in the case of a pro , nd^ ^ ^ 

cllcula^ under certain shape 
" ditions'of the three-dimensional 

before, the ^"J^ 
^"^•i- o-ecuting the processings 

- -r-r i^r- - 3., , r 

thr ee-di»ensional structures (partial structures, of 
Llecules are divided into ^ ^ ^ L es . 
structures and those having non-linear str 

them, those having linea, : struc tures a e^ 
processed at a higher speed using a meth 

below. structure in which 

Referring to Fig. 32A, the str 
tw o points at both ends of a three-d^si ^ 
structure are most distant from each other ^ 
linear structure. Keferring to F • ^ ^ 
hand, the structure in which two p ^ ^ 

are not most distant from each other 

linear structure. >nn .„„ among the 

in accomplishing correspondence a 
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elements between point sets A and B that form three- 
dimensional structures, according to this embodiment, 
after the point set B is divided depending upon the 
spatial size or the number of constituent elements of 
the point set A in order to find subsets of points 
that are candidates for the corresponding points, the 
optimum correspondence is effectively searched for 
with respect to each of the subsets. Described below 
is a method of finding the subsets. 

(1) Division of an ordered point set B according 

to the, number of constituent elements of a 

point set A. 

Fig. 33 is a diagram explaining how to divide 
a point set B according to the number of constituent 
elements of a point set A. 

The size of search space is decided according 
to the number m of elements of the point set A, and 
the point set B is divided according to the size in 
order to reduce the space to be searched, thereby 
shortening the time for calculation. In an example of 
Fig. 33, a size 10, which is twice as great as the 
number 5 of elements of the point set A, is set to be 
the size of a space to be searched, in order to effect 
the processing. 

Fig. 34 shows a division algorithm for the 

point set B. 

Ordered point sets are given as A = [a lf 

, a n ], B = [b x , — , b t , , bj, , b n ], and the 

following processing is effected for the subset B' of 
the point set B. 

Process 1: Find the number m of elements 

of the point set A. 
Process 2: Set the size (£(m)) of B' in 

compliance with a function 
f(x) that defines the size of 
the point set B' . 
Process 3: Divide the point set B to 
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obtain the following subset 
B' . 

(a) j = i + f(m) - 1 

(b) Point set B' = [b it b i+1 , 
, b^, bj] 

Process 4 : The points a l and bi are 

related to each other and then 
the remaining elements of the 
point sets A, B' are related 
to each other according to the 
method explained with 
reference to Figs. 17 to 21, 
in order to find 
correspondence that meets a 
predetermined limiting 
condition . 

Process 5: When b s is a final element of 

the point set B, the program 
is finished. 

When bj is not the final 
element of the point set B, 
obtain i = i+1 and return to 
process 3. 

(2) Division of an ordered point set B according 
to the spatial size of the point set A. 

As shown in Fig. 35A, a distance d is found 
across the two points at both ends of the point set A, 
and the point set B is divided by the distance d as 
shown in Fig. 35B in order to reduce the search space, 
thereby shortening the time for calculatxon. 
According to this method, however, since the 
correspondence of a head element of the set is not 
fixed as mentioned with reference to the process 4 of 
(1), there exists a probability that the same solution 
may be calculated many times. Therefore, prior to 
advancing to the next search space, the next search 
space is set by taking into consideration the position 
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Process 1: 



Process 2: 
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of a solution obtained in the previous search space, 
so that the search spaces will not be overlapped and 
the same solution will not be calculated many times. 

Fig. 36 is a diagram showing a division 
algorithm for the ordered point set B depending upon 
the spatial size of the point set A. 

The ordered point sets are given as A - [an 

a ] , B = [b,, — , b lf ~, b r — , b n ], and the 

following process is effected for the subset B' of the 

ooint set B. . 

Distances among points of the point 
sets A and B are calculated to 
prepare a distance table (not 
shown) . 

A distance between a first point 
and a final point (a,, a.) in the 
point set A is found from the 
distance table, and is denoted as 
d. 

Divide the point set B. 
Find from the distance table the 
one having a maximum j from among 
bj that have a distance of d±a from 
bl (i - 1, in initial state) and 
that satisfy mgj-i^2m. 
Obtain a point set B' - [b t , b 1+1/ 

— ' bj " W bj]# 

Accomplish correspondence among the 
elements of point sets A, B' 
according to the method explained 
with reference to Figs. 17 to 21, 
in order to find correspondence 
that meets a predetermined limiting 
condition . 

When b t is a final element of the 
point set B, the program is 
finished. 



process 3: 



(a) 



(b) 



Process 4: 



Process 5: 



Process 6: When b t is not the final element of 

the point set B: 

i) Obtain i = k + 1 and return to the 
process 3 when a solution that 
satisfies predetermined limiting 
condition is met between the point 
sets A and B' , where a point 
corresponding to al is bk; or 

ii) obtain i = i + 1 and return to the 
process 3 when a solution is not 
obtained between the point sets A 
and B ' . 

(3) Other method of dividing the ordered point 
set B according to the spatial size of the 
point set A. 

As shown by an algorithm of Fig. 37, it is 
possible to divide the ordered point set B depending 
on the spatial size of the point set A. Even in this 
case, a distance is found across two points at both 
ends of the point set A, and the point set B is 
divided by this distance to reduce the search space 
and to shorten the time for calculation. Moreover, at 
the time of advancing to the next search space, the 
next search space is set by taking into consideration 
the number of elements of the point set A that serve 
as search keys, so that the search spaces will not be 
overlapped and the same solution will not be 
calculated many times. 

The ordered point sets are given as A - [a x , 
— , a.], B - [b,, — , b,, ~, b j# ~, b n ], and the 
following process is effected for the subset B' of the 

point set B. 

Process 1: Distances among points of the 

points sets A and B are 
calculated to prepare a 
distance table (not shown). 
Process 2: A distance between a first 
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point and a final point (a L , 
a n ) in the point set A is 
found from the distance table, 
and is denoted as d. 
Process 3: Divide the point set B. 

(a) Find from the distance table 
the one having a maximum j from among b i that have a 
distance of d±a from b t (i - 1, in initial state) and 
that satisfy m£j-is2m. 

(b) Obtain a point set B' = [b^ 
b 1+1 , , b^, b 3 ]. 

Process 4: Accomplish correspondence 

among the elements of point 
sets A, B' according to the 
15 method explained with 

reference to Figs. 17 to 21, 
in order to find 
correspondence that meets a 
predetermined limiting 
2Q condition. 

Process 5: When b £ is a final element of 

the point set B, the program 
is finished. When b t is not 
the final element of the point 
set B, obtain i = j - m + 1 
and return to the process 3. 
In determining the correspondence among the 
points that form three-dimensional structures, the 
points are related to each other after the search 
space of three-dimensional structures is divided. 
Therefore, the points can be related to one another 
. within short periods of time. These methods can 
similarly be adapted to the processing devices that 
are described with reference to Figs. 22, 27, 28 and 

35 31. _ 

Figs. 38A shows the amino acid sequence of a 

protein trypsin, and Fig. 38B shows the amino acid 
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sequence of elastase. Figs. 38A and 38B show excerpts 
of amino acid seances registered to the PDB^ The 
amino acid sequence numbers shown in Fx,.. 38A and 38B 
are those that are simpiy given to the amino 
5 described in the PDB starting from 2 and are different 
from the traditional amino acid numbers. In the 
following description, the amino acid numbers that 
diagramed will be used. 

The trypsin and elastase that are shown are 
10 some kinds of proteolytic enzymes called 

protease, and in which histidine, serine and aspartic 
Lid are indispensable at the active sites. Though 
these ensymes have quite different substrate 

.(ftcitv they are considered to be a seri.es of 
15 ^sTrom LeVnt of view of 

are similar to each other with respect to structure 
and catalytic mechanisms . ^ ^ ^ 

hi stidine active sites of elastase with th e h ist dine 
= of trypsin as probes. It win 

20 active sites . . 

, ^ that 41-46 of elastase correspond to the 
understood that 41 4b the 

re " . .„„ ,„t.n 9 | of trypsin as 

with serine active sites (175-179) o 

25 probes fro which it will be ^"^.^VZl 9 of 
L elastase correspond to the active sites 175 179 
trypsin. These results are in agreement wxth the 

^ults obtained through ""^^^^of. 

of Thi ~ - rr>0 ' n s l onal _Structur 

^ee-dimensional structures of proteins 
contain common basic structures suet . as " ^ 

strand which are called seco "^"/^^ 
methods have heretofore been ^^J^ ln the 
35 automatic retrieval based upon the « 

secondary structures without usrn along 
According to these methods, partial stru 
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the amino acid sequence are denoted by symbols of 
secondary structures and are compared by way of 
symbols, but it was not possible to compare 
similarities of spatial position relationships of the 
elements that constitute partial structures or to 
compare similarities of spatial position relationships 
of partial structures. 

Therefore, described below are a method in 
which a set of elements constituting a molecule is 
divided into subsets based on the secondary 
structures, and the subsets are related to each other 
based on the similarities of spatial position 
relationships of elements that belong to the subsets, 
a method of evaluating similarities of spatial 
position relationships of a plurality of subsets that 
are related to one another, and a method of analysis 
by utilizing such methods. 

(1) Division of a point set into subsets. 
The structure A and the structure B are, 
respectively, constituted by a point set A = t a w 

3l , , a.), where lsi£m and a point set B 

" ' b b , b„], where lsjsn, and each 

U'nt is expressed 'by a three-dimensional coordinate 

consisting of a, = (x i( Yu 1> «* b i " }*>• " ' 

in order to facilitate determination of the 
correspondence among the points, the structure is 
divide' into partial structures that are structurally 
meaningful, and a points set is divided into subsets. 
Examples of the partial structures which are 

tructurally meaningful include functional group and 
oartial structures having certain functions in the 
Zl of chemical substances, and secondary structures 
such as helixes, sheets structures and partial 
structures developing certain functions in the 
35 P-teins. ^ q£ & structu re are 

found by using the known data or by the analysis of 
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three-dimensional coordinates. The point set A 
divided into subsets is denoted as A = [ (a x , a 2 , — - , 

-ao, — — ■■>' " ' (a ; +1 ' a ' +2 ' ""' a a ; )] s A2 - 

where l<*^m. Here, if SA1 = <a w a 2 , , a k ) , 

5 (a**, a k+2/ — , a ( ), — , SAp = (a (+1 , a, +2 , — , a.), 

then the set SA's are subsets which constitute the 
points set A, and the set A is expressed by SA's as A 
= (SA1, SA2, — , SAp). Similarly, the point set B is 
divided into SB's which are subsets of B, and is 

10 expressed as B = (SBl, SB2, , SBq) . 

(2) Determination of Correspondence among the 

subsets . 

Considered below is the determination of^ 
correspondence among elements of the structure A - 
f<5Al SA2, — , SAp) and the structure B = (SBl, SB2 , 

sB q i .., to determine the correspondence among 
subsets, 'in this case, possible correspondence can be 
described by a tree structure created by successively 
giving correspondence to the element constituting the 
20 sets. A node of the root of the tree is a starting 
point. A leaf node represents a result of possible 
setting of correspondence, and an intermediate node 
represents a partial result. Nil is used when there 
is no corresponding element. 
„ Fig. 40 is a diagram illustrating the 

possible correspondence of subsets. If a status tree 
that corresponds to all possible combinations is 
created, the number of nodes becomes significantly 
high. Therefore, the branches must be pruned. 
Namely , when the nodes are added by ^ 
correspondence between two subsets, the matching 
effected between the subsets, and the nodes are added 
prided the result satisfies the limiting condition 
The limiting condition will be described later in (4). 
The matching of the subsets is carried out » 
compliance with the method described in the Analysi 
of Three-Dimensional Structures of Molecules I . 



(3) Determination of correspondence among subsets 
wherein partial correspondence is 
predetermined and/or that are ordered. 
When partial correspondence between subsets 
5 is predetermined and/or when subsets are ordered in 
the above case (2), branches of the tree structure 
formed in (2) are pruned based thereon. 

(4) Refining the candidates by the similarity 
among the subsets. 
10 m the above methods (2) and (3), the 

branches are pruned based on the similarity between 
the two subsets that are candidates in order to 
determine the correspondence efficiently. The 
attributes possessed by the candidates and the 
15 structural similarity between the two subsets are 

taken into consideration. The attributes of subsets 
may be the kinds of functional groups and kinds of 
functions in the case of chemical substances, and the 
constituent elements in the secondary structure and 
the kinds of functions in the case of proteins . The 
structural similarity of the two subsets is judged by 
the three-dimensional structure matching method which 
accomplishes the correspondence among the elements of 
the two ordered point described in the "Analysis of 
Three-Dimensional Structure of Molecules I". The 
r.m.s.d. among the points is calculated when an 
optimum matching is effected based on this method. 

The candidates can be refined by generating 
nodes of correspondence only when the two subsets that 
are the candidates have the same attribute and their 
r.m.s.d. values are smaller than a threshold value. 
Fig. 41 shows an algorithm for determining 
correspondence of subsets of the sets A and B where 
the above limiting condition is taken into 
consideration . 

In Fig. 41, a subset is taken out from the 
point set A and is denoted as SA. Further, and 
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element SB that is not included in the ancestor or 
siblings of the tree structure is taken out from the 
point set SB and is denoted as d r When there is no 
element that can be taken out, then dj-nil. 

Then, SA and d i are examined in regard to 
whether their attributes are the same or not, and when 
the attributes are not the same, the' combination is 
discarded for pruning. When the attributes are the 
same, the point sets are matched, and an r.m.s.d. 
value is calculated under the optimum matching. When 
this value is smaller than a predetermined threshold 
value, SA and dj are related to each other, and are 
registered as child nodes of d^ in the tree 
structure, and correspondence of an optimum point is 
stored in the sequence. The above-mentioned 
processing is repeated for all of the subsets. 

(5) Decision of similarity between the structure 
A and the structure B. 

Two point sets are created using elements 
belonging to the subsets related in (4) above, and an 
r m s.d. value between them is calculated in 
compliance with Kabsh's method, and when the value is 
smaller than the threshold value, it is decided that 
the two structures are similar to each other. 

Described below is a system for retrieving 
three-dimensional structures of proteins using the 
secondary structural similarity that can be realized 
based on the above-mentioned method. 

Fig. 42 illustrates the constitution of a 
retrieval system that is made up of a data base 160 to 
which are registered three-dimensional structure data 
of proteins, a secondary structure calculation unit 
161 that determines a secondary structure from the 
three-dimensional structure data in the data base 160 
and divides it into partial structures, a secondary 
structure coordinate table 162 that stores the results 
obtained by the secondary structure calculation unit 
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161 as a type of the secondary structure and three- 
dimensional coordinates of points that constitute the 
type of the secondary structure, an input unit 163 
that reads an input command of a user, a retrieving 

5 unit 164 that retrieves a similar structure based on 
the aforementioned method relying on the command that 
is input and the data in the secondary structure 
coordinate table, and a display unit 165 that 
graphically displays the retrieved result. Details of 

10 the units will now be described. 

(a) Data base 160. 

The data base stores three-dimensional 
structure data of proteins . Name and three- 
dimensional coordinate date of constituent atoms are 
15 registered for each of the proteins. 

(b) Secondary structure calculation unit 
161. 

The secondary structure calculation unit 
161 divides the structure of a protein into types of 
20 secondary structures based on the three-dimensional 

coordinates in the data base, and divides a point set 
into subsets. Table I shows the types of the 
secondary structures and the definitions thereof. The 
type the i-th amino acid belongs to is sequentially 
25 determined according to the definitions shown in 
Table I, and subsets are created from a series of 
coordinates of the amino acid belonging to the same 
type. The thus determined type of the secondary 
structure and the coordinate data of the constituent 
30 amino acid are stored in the secondary structure 

coordinate table 162. By repeating this operation, n 
amino acids are all grouped into subsets. Fig. 
shows a flow of process related to the determination 
of the secondary structure and division into subsets. 



Table I: Types of secondary struc tures and their definitions 



Type 


Definition 


3 l0 -Heiix 


Structure in which carbonyl group of i-th residues 
and amide groups of i+3-th residues are aligned by 
hydrogen bonds therebetween. 


a-Helix 


Structure in which carbonyl groups of an i-th 
residues and amide groups of an i+4-th residues are 
aligned by hydrogen bonds therebetween. 


Parallel 
(3-sheet 


Structure in which hydrogen bonds are formed between 

carbonyl groups of i-l-th residues and amide groups 

of j-th residues and between carbonyl groups of j-th 

«r,^ am-iHo ornuns of i+l-th residues, or 
residues and amiae groups ui at- a. 

hydrogen bonds are formed between carbonyl groups of 
j-l-th residues and amide groups of i-th residues and 
between carbonyl groups of i-th residues and amide 
groups of j+l-th residues. 


3 -Turn 


Structure in which hydrogen bonds are formed between 
carbonyl groups of i-th residues and amide groups of 
i+2-th residues. 



(c) Secondary Structure Coordinate table 162 
Fig. 44 illustrates a constitution of 

the secondary structure coordinate table 16 2 where the 
types of the secondary structures determined by the 
secondary structure calculation unit 161 and the 
coordinate date of amino acids constituting the 
secondary structure are stored. In this example, the 
subsets SI and S2 belongs to the type of a-helxx and 
the partial sets S3, —belongs to the type of p- 
sheet . 

(d) Input unit 163. 

The input unit 163 reads the name of a 
protein that serves as a retrieval key based on the 
secondary structure coordinate table 162 and the xnput 
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command from the user, and sends it to the retrieving 
unit 164. 

(e) Retrieving unit 164. 

Fig. 45 shows a processing carried out 
by the retrieving unit 164. The retrieving unit 164 
reads the data stored in the secondary structure 
coordinate table 16 2 regarding a protein that serves 
as a key sent from the input unit 163 determines the 
correspondence of subsets, calculates the r.m.s.d 
between the two structures, and selects the one having 
an r.m.s.d. value that is smaller than the threshold 
value, thereby retrieving the structure having a high 
degree of similarity. The correspondence is 
determined based on the aforementioned method of 
determining correspondence among the subsets. In this 
case, the attribute of the subsets is the type of 
secondary structure. The correspondence is fixed only 
when the type of the two subsets are the same and when 
the r.m.s.d. value is smaller than the threshold value 
when the structures are best matched. 

Next, points are matched with each other 
with regard to the sets constituted by points that 
belongs to the related subsets, and the r.m.s.d. value 
of the whole structure is calculated. In the example 
of Fig. 40, SA1 and SBl are related to each other, and 
SA2 and SB3 are related to each other. In this case, 
match is effected among the points belonging to the 
sets (SA1, SA2) and the points belonging to the sets 
(SBl, SB3), and the r.m.s.d. value is calculated. 
When the r.m.s.d. value is smaller than the threshold 
value, the structure is determined to have a 
similarity and is registered to the retrieved result. 
This operation is carried out for all of the proteins 
stored in the secondary structure coordinate 
table 162, and the three-dimensional structures that 
are similar to each other in secondary structure are 
retrieved from all of the data. 
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(f) Display unit. 

Based on the results retrieved by the 
retrieving unit 164, the display unit 165 displays the 
name of proteins having similar structures, secondary 
5 structures of a key protein and proteins having 

similar structures, and amino acids constituting the 
secondary structures. 

Fig. 46 shows examples of outputs. 
Figs. 47A and 47B illustrate three-dimensional 
10 structures of a key protein A used in retrieval and a 
protein B having a similar structure that is 
retrieved . 

In Figs. 47A and 47B, a partial 
structure of a-helix is represented by a helical 

15 ribbon, a partial structure of J3-strand is represent 
by an arrow, and partial structures of loop and turn 
are represented by tubes. As a result, it will be 
understood that the key protein is divided into four 
partial structures of a-helix, p-strand, loop and 0- 

20 strand in the order of amino acid sequence, and these 
partial structure correspond to subsets SA1, SA2, SA3 
and SA4, respectively. 

Referring to Fig. 46, subsets SA1 , SA2 
and SA4 in A are similar to subsets SB10, SB1 and SB3 

25 indicated by arrows in B, and are further similar in 
their relationship of spatial positions of the three 
partial structures. In A, a loop portion SA does not 
have an arrow indicating that there is no similar 
partial structure. Similar portions in the protein B 

30 of similar structure are hatched in Fig. 47B. 



